Scenario
To combat this I propose a predictive AI model that forecasts pedestrian flow across various sensors in the city using historical pedestrian count data, scheduled event data, event locations, and weather conditions. The outcome will help optimise public safety planning, infrastructure usage, and event coordination.
User Story¶
As a city data scientist at the City of Melbourne,
I want to use an AI model that can accurately predict pedestrian flow patterns based on historical pedestrian sensor data, weather conditions, and scheduled event data,
so that I can provide insights to city planners and emergency services to proactively manage pedestrian movement and reduce overcrowding during high-traffic events.
Introduction¶
Understanding pedestrian movement patterns is essential for enhancing city planning, improving public safety, and ensuring smooth mobility during both regular days and large-scale events. The City of Melbourne collects a vast amount of data from pedestrian sensors located across the city, which, when combined with contextual data such as weather conditions and event schedules, can provide valuable insights into human mobility.
This project focuses on developing an AI-based model that utilises the datasets to accurately predict pedestrian flow. By predicting movement trends, the city can proactively manage crowds, allocate resources efficiently, and optimise infrastructure usage. This approach not only supports safer and more efficient urban environments but also contributes to a smarter, data-driven city management strategy.
*Dataset Links*
- Pedestrian counting link: https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/api/
- Venue location data link: https://data.melbourne.vic.gov.au/explore/dataset/venues-for-event-bookings/api/
- Weather station data link: https://data.melbourne.vic.gov.au/explore/dataset/meshed-sensor-type-1/information/
Importing Required Libraries¶
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import os
import requests
from functools import reduce
Importing Datasets¶
To build the AI-based pedestrian flow prediction model, we begin by importing the relevant datasets from the City of Melbourne Open Data API. The following datasets are used in this project:
- Pedestrian Counting System: Hourly pedestrian counts from sensors installed across the city.
- Event Locations: Data on the locations and timings of public events in Melbourne.
- Weather Data: Historical weather conditions including temperature, rainfall, and wind.
These datasets are accessed using API requests and loaded into pandas DataFrames for preprocessing and analysis.
def fetch_data(base_url, dataset, api_key, num_records=99, offset=0):
all_records = []
max_offset = 9900 # Maximum number of requests
while True:
# Maximum limit check
if offset > max_offset:
break
# Create API request URL
filters = f'{dataset}/records?limit={num_records}&offset={offset}'
url = f'{base_url}{filters}&api_key={api_key}'
# Start request
try:
result = requests.get(url, timeout=10)
result.raise_for_status()
records = result.json().get('results')
except requests.exceptions.RequestException as e:
raise Exception(f"API request failed: {e}")
if records is None:
break
all_records.extend(records)
if len(records) < num_records:
break
# Next cycle offset
offset += num_records
# DataFrame all data
df = pd.DataFrame(all_records)
return df
#os.environ.get('MELBOURNE_API_KEY')
# Retrieve API key from environment variable
BASE_URL = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
Importing Pedestrian Count Data¶
# Data set name
dataset_pedestrian_counting = 'pedestrian-counting-system-monthly-counts-per-hour'
# Fetch dataset
pedestrian_counting = fetch_data(BASE_URL, dataset_pedestrian_counting, API_KEY)
# Create a new row named latitude
pedestrian_counting['latitude'] = pedestrian_counting['location'].apply(lambda x: x['lat'] if isinstance(x, dict) else None)
# Create a new row named longitude
pedestrian_counting['longitude'] = pedestrian_counting['location'].apply(lambda x: x['lon'] if isinstance(x, dict) else None)
pedestrian_counting.head()
| id | location_id | sensing_date | hourday | direction_1 | direction_2 | pedestriancount | sensor_name | location | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 862320241010 | 86 | 2024-10-10 | 23 | 0 | 5 | 5 | 574Qub_T | {'lon': 144.94908064, 'lat': -37.80309992} | -37.803100 | 144.949081 |
| 1 | 862020241223 | 86 | 2024-12-23 | 20 | 13 | 31 | 44 | 574Qub_T | {'lon': 144.94908064, 'lat': -37.80309992} | -37.803100 | 144.949081 |
| 2 | 24020240611 | 24 | 2024-06-11 | 0 | 47 | 50 | 97 | Col620_T | {'lon': 144.95449198, 'lat': -37.81887963} | -37.818880 | 144.954492 |
| 3 | 72120250705 | 72 | 2025-07-05 | 1 | 3 | 28 | 31 | ACMI_T | {'lon': 144.96872809, 'lat': -37.81726338} | -37.817263 | 144.968728 |
| 4 | 10720250707 | 10 | 2025-07-07 | 7 | 22 | 74 | 96 | BouHbr_T | {'lon': 144.94710545, 'lat': -37.81876474} | -37.818765 | 144.947105 |
Importing Weather Data¶
# Data set name
weather_stations = 'meshed-sensor-type-1'
# Fetch dataset
weather_stations_df = fetch_data(BASE_URL, weather_stations, API_KEY)
weather_stations_df.head()
| device_id | time | rtc | battery | solarpanel | command | solar | precipitation | strikes | windspeed | winddirection | gustspeed | vapourpressure | atmosphericpressure | relativehumidity | airtemp | lat_long | sensor_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | atmos41-32fc | 2025-08-16T20:40:34+00:00 | 121366929 | 4.140 | 1.458 | 0 | 0 | 0 | 0 | 1.38 | 174.9 | 2.11 | 0.74 | 101.18 | 86 | 4.8 | None | None |
| 1 | atmos41-32fc | 2025-08-16T21:10:53+00:00 | 121368748 | 4.137 | 19.340 | 0 | 3 | 0 | 0 | 1.55 | 170.9 | 2.32 | 0.73 | 101.22 | 86 | 4.6 | None | None |
| 2 | atmos41-32fc | 2025-08-16T20:10:00+00:00 | 121365095 | 4.140 | 0.225 | 0 | 0 | 0 | 0 | 1.52 | 171.7 | 2.44 | 0.74 | 101.16 | 86 | 4.8 | None | None |
| 3 | atmos41-32fc | 2025-08-17T00:11:54+00:00 | 121379609 | 4.208 | 22.161 | 0 | 140 | 0 | 0 | 1.05 | 155.4 | 2.67 | 0.91 | 101.33 | 83 | 8.3 | None | None |
| 4 | atmos41-32fc | 2025-08-17T02:26:30+00:00 | 121387685 | 4.209 | 22.064 | 0 | 175 | 0 | 0 | 0.89 | 152.1 | 2.80 | 1.01 | 101.32 | 67 | 13.2 | None | None |
Importing Venue Location Data¶
venue_location = fetch_data(BASE_URL, 'venues-for-event-bookings', API_KEY)
# Create a new row named latitude
venue_location['latitude'] = venue_location['geo_point_2d'].apply(lambda x: x['lat'] if isinstance(x, dict) else None)
# Create a new row named longitude
venue_location['longitude'] = venue_location['geo_point_2d'].apply(lambda x: x['lon'] if isinstance(x, dict) else None)
venue_location.head()
| geo_point_2d | geo_shape | prop_id | no_smoking | level_1_na | addresspt1 | event | full_name | addressp_1 | training | ... | sport | promotion | bookable | level_3_na | wedding | roadseg_id | sustainabi | level_2_na | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | {'lon': 144.98896595113547, 'lat': -37.8171274... | {'type': 'Feature', 'geometry': {'coordinates'... | 0 | N | Other Park Locations | 29.98397741 | Y | Other Park Locations: Weedon Reserve | 77 | N | ... | None | None | Y | None | N | 21951 | L | Weedon Reserve | -37.817127 | 144.988966 |
| 1 | {'lon': 144.98293613155118, 'lat': -37.8425742... | {'type': 'Feature', 'geometry': {'coordinates'... | 103756 | N | Fawkner Park | 0.0 | N | Fawkner Park: FP - Cordner_T1_T2_T3_F2_Ct3 | 0 | N | ... | Y | None | Y | None | N | 0 | M | FP - Cordner_T1_T2_T3_F2_Ct3 | -37.842574 | 144.982936 |
| 2 | {'lon': 144.9554621837315, 'lat': -37.78557214... | {'type': 'Feature', 'geometry': {'coordinates'... | 107426 | N | Royal Park | 0.0 | N | Royal Park: RP - Lawn 6_Walker East_Ct2 | 0 | N | ... | Y | None | Y | None | N | 0 | L | RP - Lawn 6_Walker East_Ct2 | -37.785572 | 144.955462 |
| 3 | {'lon': 144.98286097598913, 'lat': -37.8445902... | {'type': 'Feature', 'geometry': {'coordinates'... | 103756 | N | Fawkner Park | 0.0 | Y | Fawkner Park: FP - Lawn 21_T8_S3 | 0 | Y | ... | Y | None | Y | None | N | 0 | M | FP - Lawn 21_T8_S3 | -37.844590 | 144.982861 |
| 4 | {'lon': 144.97660241098285, 'lat': -37.8246014... | {'type': 'Feature', 'geometry': {'coordinates'... | 108615 | N | Kings Domain | 0.0 | Y | Kings Domain: KD - Lawn 9/Pillars of Wisdom | 0 | Y | ... | None | None | Y | None | Y | 0 | M | KD - Lawn 9/Pillars of Wisdom | -37.824601 | 144.976602 |
5 rows × 24 columns
Cleaning Datasets¶
Once the datasets are imported, the next step is to clean and prepare the data for analysis. This involves the following tasks:
- Handling Missing Values: Identifying and filling or removing missing or null entries to ensure consistency.
- Parsing Date and Time: Converting date and time columns into proper datetime formats for easier merging and time-based analysis.
- Extracting Relevant Columns: Dropping irrelevant or redundant columns and keeping only the necessary features.
- Standardising Column Names: Renaming columns for consistency across datasets.
- Filtering by Location or Date: Limiting data to specific sensors, areas, or timeframes relevant to the prediction task.
Clean and well-structured data ensures accurate model training and better predictive performance.
Data Quality and Summary Statistics Overview Pedestrian Count Data¶
- Missing data Check
- I check the total number of missing data within each column with the sue of
.isnull().sum(), this will indicate if there is missing data and which column does have it.
- I check the total number of missing data within each column with the sue of
- Statistical Description
- To obtain a statistical summary I use the
.describe()method. This will aid in understanding the distribution of the data and will point out any annomailies.
- To obtain a statistical summary I use the
- Data Structure Summary
- To obtain a statistical overview of the dataset and find outliers it is useful to implement
.info()
- To obtain a statistical overview of the dataset and find outliers it is useful to implement
print("Check for missing information in each column")
print(pedestrian_counting.isnull().sum())
print("Statistical Description")
print(pedestrian_counting.describe())
print("Data Structure Summary")
print(pedestrian_counting.info())
Check for missing information in each column
id 0
location_id 0
sensing_date 0
hourday 0
direction_1 0
direction_2 0
pedestriancount 0
sensor_name 0
location 0
latitude 0
longitude 0
dtype: int64
Statistical Description
id location_id hourday direction_1 direction_2 \
count 9.999000e+03 9999.000000 9999.000000 9999.000000 9999.000000
mean 4.575143e+11 68.522652 11.856286 192.428243 194.903390
std 4.813402e+11 46.616129 6.731115 295.791198 305.159582
min 1.020240e+09 1.000000 0.000000 0.000000 0.000000
25% 6.762025e+10 30.000000 6.000000 19.000000 20.000000
50% 2.519202e+11 58.000000 12.000000 84.000000 82.000000
75% 7.015203e+11 107.000000 18.000000 229.000000 236.500000
max 1.851920e+12 185.000000 23.000000 4930.000000 5589.000000
pedestriancount latitude longitude
count 9999.000000 9999.000000 9999.000000
mean 387.331633 -37.813151 144.960284
std 567.258374 0.006659 0.009135
min 0.000000 -37.825910 144.929734
25% 40.000000 -37.817940 144.954492
50% 172.000000 -37.814141 144.962997
75% 485.000000 -37.809993 144.966589
max 5926.000000 -37.794324 144.974677
Data Structure Summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 9999 non-null int64
1 location_id 9999 non-null int64
2 sensing_date 9999 non-null object
3 hourday 9999 non-null int64
4 direction_1 9999 non-null int64
5 direction_2 9999 non-null int64
6 pedestriancount 9999 non-null int64
7 sensor_name 9999 non-null object
8 location 9999 non-null object
9 latitude 9999 non-null float64
10 longitude 9999 non-null float64
dtypes: float64(2), int64(6), object(3)
memory usage: 859.4+ KB
None
*Missing data Check*
- We can see from the results obtained there is no missing data in any column, therefore no need to handle missing values.
*Statistical Description*
- Key highlights from statiscal summary:
- hourday:
-ranges from values 0-23 which indicates an hourly data spread across a day.
- The data indicates an even spread of data with a mean of 11.7 and a median of 12.
- pedestriancount:
- we see a very wide range of data indicating for possibilities of outliers within the dataset.
- the mean and median values are different, the mean value of 384.36 and a median value of 166.00 indicating the data is skewed right.
- direction-1 and direction_2: -The direction of the pedestrian count indicates north or south and informs us of the direction of which the count was obtained
*Data Structure Summary*
- We can see the data type of each key feature, providing us with a clear outline of what data we are dealing with. `
pedestrian_counting.drop(columns=['id', 'sensor_name', 'location'], inplace=True)
pedestrian_counting.head()
| location_id | sensing_date | hourday | direction_1 | direction_2 | pedestriancount | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|
| 0 | 86 | 2024-10-10 | 23 | 0 | 5 | 5 | -37.803100 | 144.949081 |
| 1 | 86 | 2024-12-23 | 20 | 13 | 31 | 44 | -37.803100 | 144.949081 |
| 2 | 24 | 2024-06-11 | 0 | 47 | 50 | 97 | -37.818880 | 144.954492 |
| 3 | 72 | 2025-07-05 | 1 | 3 | 28 | 31 | -37.817263 | 144.968728 |
| 4 | 10 | 2025-07-07 | 7 | 22 | 74 | 96 | -37.818765 | 144.947105 |
#updating sensing date data to uniform across all datasets
pedestrian_counting['sensing_date']= pd.to_datetime(pedestrian_counting['sensing_date'])
#create a new time col date_time, and combine sensing_date and hourday
pedestrian_counting['date_time'] = pedestrian_counting['sensing_date'] + pd.to_timedelta(pedestrian_counting['hourday'], unit= 'h')
pedestrian_counting.head()
| location_id | sensing_date | hourday | direction_1 | direction_2 | pedestriancount | latitude | longitude | date_time | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 86 | 2024-10-10 | 23 | 0 | 5 | 5 | -37.803100 | 144.949081 | 2024-10-10 23:00:00 |
| 1 | 86 | 2024-12-23 | 20 | 13 | 31 | 44 | -37.803100 | 144.949081 | 2024-12-23 20:00:00 |
| 2 | 24 | 2024-06-11 | 0 | 47 | 50 | 97 | -37.818880 | 144.954492 | 2024-06-11 00:00:00 |
| 3 | 72 | 2025-07-05 | 1 | 3 | 28 | 31 | -37.817263 | 144.968728 | 2025-07-05 01:00:00 |
| 4 | 10 | 2025-07-07 | 7 | 22 | 74 | 96 | -37.818765 | 144.947105 | 2025-07-07 07:00:00 |
Visualising Pedestrian Count¶
Bar plot that indicates the total pedestrian count by hour, this will allow us to identify the peak foot traffic hours across the city.
# Group by hour and sum pedestrian counts
hourly_counts = pedestrian_counting.groupby('hourday')['pedestriancount'].sum()
# Plot the grouped data
plt.figure(figsize=(12, 6))
hourly_counts.plot(kind='bar', color='red')
# Add titles and labels
plt.title('Total Pedestrian Count by Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Total Pedestrian Count')
plt.grid(True)
plt.tight_layout()
plt.show()
From the results we can clearly see that the hours of 13 and 17 indicating the two most busiest periods of the day. I beleive this is a good representation of lunch breaks and when work finishes for most individuals.
Data Quality and Summary Statistics Overview Weather data¶
- Missing data Check
- I check the total number of missing data within each column with the sue of
.isnull().sum(), this will indicate if there is missing data and which column does have it.
- I check the total number of missing data within each column with the sue of
- Statistical Description
- To obtain a statistical summary I use the
.describe()method. This will aid in understanding the distribution of the data and will point out any annomailies.
- To obtain a statistical summary I use the
- Data Structure Summary
- To obtain a statistical overview of the dataset and find outliers it is useful to implement
.info()
- To obtain a statistical overview of the dataset and find outliers it is useful to implement
print("Check for missing information in each column")
print(weather_stations_df .isnull().sum())
print("Statistical Description")
print(weather_stations_df .describe())
print("Data Structure Summary")
print(weather_stations_df .info())
Check for missing information in each column
device_id 0
time 0
rtc 0
battery 0
solarpanel 0
command 0
solar 0
precipitation 0
strikes 0
windspeed 0
winddirection 0
gustspeed 0
vapourpressure 0
atmosphericpressure 0
relativehumidity 0
airtemp 0
lat_long 9999
sensor_name 9999
dtype: int64
Statistical Description
rtc battery solarpanel command solar \
count 9.999000e+03 9999.000000 9999.000000 9999.0 9999.000000
mean 9.354407e+07 4.180553 10.578352 0.0 114.418142
std 1.671959e+07 0.025408 10.102549 0.0 198.805517
min 5.862781e+06 4.128000 0.000000 0.0 0.000000
25% 8.360097e+07 4.158000 0.233000 0.0 0.000000
50% 9.495296e+07 4.181000 11.596000 0.0 2.000000
75% 1.061945e+08 4.208000 21.411000 0.0 142.500000
max 1.213912e+08 4.232000 23.724000 0.0 970.000000
precipitation strikes windspeed winddirection gustspeed \
count 9999.0 9999.000000 9999.000000 9999.000000 9999.000000
mean 0.0 0.007401 -1.951308 156.042764 -0.297152
std 0.0 0.433108 173.197879 182.993800 173.232029
min 0.0 0.000000 -9999.000000 -9999.000000 -9999.000000
25% 0.0 0.000000 0.710000 144.600000 1.620000
50% 0.0 0.000000 0.960000 170.300000 2.410000
75% 0.0 0.000000 1.290000 183.600000 3.470000
max 0.0 42.000000 3.480000 359.900000 13.570000
vapourpressure atmosphericpressure relativehumidity airtemp
count 9999.000000 9999.000000 9999.000000 9999.000000
mean 1.336804 101.263879 72.445245 16.156536
std 0.375210 0.802949 14.029945 5.985070
min 0.580000 98.590000 18.000000 0.800000
25% 1.050000 100.740000 65.000000 12.000000
50% 1.260000 101.290000 75.000000 15.400000
75% 1.570000 101.830000 83.000000 19.200000
max 3.020000 103.600000 100.000000 40.300000
Data Structure Summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 device_id 9999 non-null object
1 time 9999 non-null object
2 rtc 9999 non-null int64
3 battery 9999 non-null float64
4 solarpanel 9999 non-null float64
5 command 9999 non-null int64
6 solar 9999 non-null int64
7 precipitation 9999 non-null int64
8 strikes 9999 non-null int64
9 windspeed 9999 non-null float64
10 winddirection 9999 non-null float64
11 gustspeed 9999 non-null float64
12 vapourpressure 9999 non-null float64
13 atmosphericpressure 9999 non-null float64
14 relativehumidity 9999 non-null int64
15 airtemp 9999 non-null float64
16 lat_long 0 non-null object
17 sensor_name 0 non-null object
dtypes: float64(8), int64(6), object(4)
memory usage: 1.4+ MB
None
*Missing data Check*
- We can see from the results obtained there is no missing data in any column, therefore no need to handle missing values.
*Statistical Description*
- Key highlights from statiscal summary:
precipitiation:- All values are zero this column holds no other value and is a good canditate to be dropped
solar:- We see a very high max value of 1043 data is strongly skewed, high sunlight levels.
- the mean and median values are different
strikes(lightning): -With a mean value of 0.0048 indicate very few events and a max of 9 indicate low lightnign occurenceswindspeed,gustspeed,winddirection:- All have extremely large outlier values of -9999.0. this may need to be removed to extract valuable information with similar windspeed and gustspeed averages.
relativehumidity:- has a range from 18 to 100
airtemp:- has a range from 1deg to 39.2deg celcius indicating normal weather patterns for the city of melbourne.
*Data Structure Summary*
- We can see the data type of each key feature, providing us with a clear outline of what data we are dealing with. `
- We see that the time needs to be converted to datetime format for easy time analysis.
- lat/lon should also be split to refelct the changes made to the previous dataset to be matched.
weather_stations_df
# Create a new row named latitude
weather_stations_df['latitude'] = weather_stations_df['lat_long'].apply(lambda x: x['lat'] if isinstance(x, dict) else None)
# Create a new row named longitude
weather_stations_df['longitude'] = weather_stations_df['lat_long'].apply(lambda x: x['lon'] if isinstance(x, dict) else None)
weather_stations_df.head()
| device_id | time | rtc | battery | solarpanel | command | solar | precipitation | strikes | windspeed | winddirection | gustspeed | vapourpressure | atmosphericpressure | relativehumidity | airtemp | lat_long | sensor_name | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | atmos41-32fc | 2025-08-16T20:40:34+00:00 | 121366929 | 4.140 | 1.458 | 0 | 0 | 0 | 0 | 1.38 | 174.9 | 2.11 | 0.74 | 101.18 | 86 | 4.8 | None | None | None | None |
| 1 | atmos41-32fc | 2025-08-16T21:10:53+00:00 | 121368748 | 4.137 | 19.340 | 0 | 3 | 0 | 0 | 1.55 | 170.9 | 2.32 | 0.73 | 101.22 | 86 | 4.6 | None | None | None | None |
| 2 | atmos41-32fc | 2025-08-16T20:10:00+00:00 | 121365095 | 4.140 | 0.225 | 0 | 0 | 0 | 0 | 1.52 | 171.7 | 2.44 | 0.74 | 101.16 | 86 | 4.8 | None | None | None | None |
| 3 | atmos41-32fc | 2025-08-17T00:11:54+00:00 | 121379609 | 4.208 | 22.161 | 0 | 140 | 0 | 0 | 1.05 | 155.4 | 2.67 | 0.91 | 101.33 | 83 | 8.3 | None | None | None | None |
| 4 | atmos41-32fc | 2025-08-17T02:26:30+00:00 | 121387685 | 4.209 | 22.064 | 0 | 175 | 0 | 0 | 0.89 | 152.1 | 2.80 | 1.01 | 101.32 | 67 | 13.2 | None | None | None | None |
#list all sensor names
print(weather_stations_df['sensor_name'].nunique())
#list all location
print(weather_stations_df['latitude'].nunique())
print(weather_stations_df['longitude'].nunique())
0 0 0
The results indicate that each of these only have one unique varaible indicating no variance in the data. This means there is only one sensor in one location this information indicates that sensor name and dev id are not of value to the AI pedestrain flow prediction.
Dropping Unnecessary Columns¶
To clean the dataset redundant information is dropped the columns dropped are:
dev_id: A unique identifier for the weather station device however provided there is only one device it will not be necessary and will not contribute any valueable information.sensor_nameAnother label for the dev_id this information does not contribute to implementing AI pedestrian flow predictor.lat_longThe location is seperated into longitude and latitude columns and is no longer necessary, also there is only one location utilised.
weather_stations_df.drop(columns=['device_id', 'sensor_name', 'lat_long'], inplace=True)
weather_stations_df['date_time'] = pd.to_datetime(weather_stations_df['time'])
weather_stations_df.head()
| time | rtc | battery | solarpanel | command | solar | precipitation | strikes | windspeed | winddirection | gustspeed | vapourpressure | atmosphericpressure | relativehumidity | airtemp | latitude | longitude | date_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2025-08-16T20:40:34+00:00 | 121366929 | 4.140 | 1.458 | 0 | 0 | 0 | 0 | 1.38 | 174.9 | 2.11 | 0.74 | 101.18 | 86 | 4.8 | None | None | 2025-08-16 20:40:34+00:00 |
| 1 | 2025-08-16T21:10:53+00:00 | 121368748 | 4.137 | 19.340 | 0 | 3 | 0 | 0 | 1.55 | 170.9 | 2.32 | 0.73 | 101.22 | 86 | 4.6 | None | None | 2025-08-16 21:10:53+00:00 |
| 2 | 2025-08-16T20:10:00+00:00 | 121365095 | 4.140 | 0.225 | 0 | 0 | 0 | 0 | 1.52 | 171.7 | 2.44 | 0.74 | 101.16 | 86 | 4.8 | None | None | 2025-08-16 20:10:00+00:00 |
| 3 | 2025-08-17T00:11:54+00:00 | 121379609 | 4.208 | 22.161 | 0 | 140 | 0 | 0 | 1.05 | 155.4 | 2.67 | 0.91 | 101.33 | 83 | 8.3 | None | None | 2025-08-17 00:11:54+00:00 |
| 4 | 2025-08-17T02:26:30+00:00 | 121387685 | 4.209 | 22.064 | 0 | 175 | 0 | 0 | 0.89 | 152.1 | 2.80 | 1.01 | 101.32 | 67 | 13.2 | None | None | 2025-08-17 02:26:30+00:00 |
Visualising weather station df¶
A line plot will be utilised to be able to observe air tempreture over time.
# change time format to be updated
weather_stations_df['time'] = pd.to_datetime(weather_stations_df['time'])
# groupby airtemp by time with average tempreture
temperature_time = weather_stations_df.groupby('time')['airtemp'].mean()
# Visualing a line graph
plt.figure(figsize=(12, 6))
temperature_time.plot(color='blue')
plt.title('Air Temperature Over Time')
plt.xlabel('Time')
plt.ylabel('Air Temperature (°C)')
plt.grid(True)
plt.tight_layout()
plt.show()
*Air Temperature Over Time*
Analysing the results obtained, it is clearly that there is less significant data before 2024-01, as the number of data points are quite low, inidcating data collection frequency during this time was lower than expected. However, the general trend of the data seems to be consistent with the higher frequency data, indicating tempreture has increased in summer and decreases in winter.
# Convert 'time' to datetime format
weather_stations_df['time'] = pd.to_datetime(weather_stations_df['time'])
# groupby time
temperature_time = weather_stations_df.groupby('time')['atmosphericpressure'].mean()
# Plotting
plt.figure(figsize=(12, 6))
temperature_time.plot(color='green')
plt.title('atmosphericpressure Over Time')
plt.xlabel('Time')
plt.ylabel('atmosphericpressure (Kpa)')
plt.grid(True)
plt.tight_layout()
plt.show()
*Atmospheric Pressure Over Time*
Analysing the results obtained, it is clearly that there is less significant data before 2024-01, as the number of data points are quite low, inidcating data collection frequency during this time was lower than expected. However, the general trend of the data seems to be consistent with the more recent data similar to the data represented in Air tempreture.
Uniform Date time¶
pedestrian_hourly the pedestrian counting dataset with just date_time and pedestrian count. Utilised for visulaising the relationship between pedestrian count and various weather metrics.
Enusre date_time is uniform in both datasets to be able to merge.
pedestrian_hourly = pedestrian_counting.groupby('date_time')['hourday'].sum().reset_index()
pedestrian_hourly.columns = ['date_time', 'pedestrian_count']
pedestrian_hourly['date_time'] = pedestrian_hourly['date_time'].dt.tz_localize(None).dt.floor('h')
weather_stations_df['date_time'] = weather_stations_df['date_time'].dt.tz_localize(None).dt.floor('h')
Visualising weather station features against Pedestrian count¶
All key features are visually examined against pedestrian count, this is done by merging the datasets into the combined_df
plt.figure(figsize= (12, 10))
#create a combined dataset of pedestrian and weather data merge by date_time
combined_df = pd.merge(pedestrian_hourly, weather_stations_df, on= 'date_time', how= 'inner')
# Scatterplot air temperature vs pedestrian Count
plt.figure(figsize=(10, 6))
sns.scatterplot(data=combined_df, x='airtemp', y='pedestrian_count', alpha=0.5, color = 'red')
plt.title("Air Temperature v. Pedestrian Count")
plt.xlabel("Air Temperature (°C)")
plt.ylabel("Pedestrian Count")
plt.tight_layout()
plt.show()
<Figure size 1200x1000 with 0 Axes>
*Air Temperature vs Pedestrian Count*
Analysing the results obtained in the scatterplot, data is grouped around the 10°C and 25°C range. The pedestrian count range is clustered around 0-20, and smaller distribution of higher pedestrian count with the peak count values within 10°C and 25°C range. From this scatterplot, it inidicates that more moderate tempretures result in increased pedestrian activity, with the extreme tempretures, that is less than 5°C and greater than 30°C having a clear negative correlation with pedestrian count. The relationship is indicative of a quadratic relationship as both low and high tempretures have lower pedestrian count values and highest pedestrian count exists within moderate temperature values.
# Scatterplot atmospheric pressue vs pedestrian count
plt.figure(figsize=(10, 6))
sns.scatterplot(data=combined_df, x='atmosphericpressure', y='pedestrian_count', alpha=0.5, color = 'green')
plt.title("Atmospheric pressure vs Pedestrian Count")
plt.xlabel("Atmospheric pressure (kPa)")
plt.ylabel("Pedestrian Count")
plt.tight_layout()
plt.show()
*Atmospheric Pressure vs Pedestrian Count*
Analysing the results obtained in the scatterplot for Atmospheric Pressure vs Pedestrian Count, we can see the range is between 99 to 103 kPa. From this data no clear linear relationship can be distinguised. Indicating Atmospheric Pressure alone is not a strong predictor, would be best not to include in the model as it may incorporate noise into the models performance.
# Scatterplot windspeed vs pedestrian Count
combined_df = combined_df[combined_df['windspeed'] > -100]
plt.figure(figsize=(10, 6))
sns.scatterplot(data=combined_df, x='windspeed', y='pedestrian_count', alpha=0.5)
plt.title("Wind speed vs. Pedestrian Count")
plt.xlabel("Wind (m/s)")
plt.ylabel("Pedestrian Count")
plt.tight_layout()
plt.show()
*Wind Speed vs Pedestrian Count*
Analysing the results obtained in the scatterplot, data is grouped between 0.5m/s and 1.5m/s. The peak pedestrian traffic also occurs during that range, with a significant decrease in pedestrian count as wind speed increases. People tend to prefer walking in calmer weather and high wind speeds deters pedestrian foot traffic. There is a clear negative linear relationship between the two features and this data inidicates wind speed will be useful for predicitive modeling.
#Scatterplot vapourpressure vs pedestrian count
plt.figure(figsize=(10, 6))
sns.scatterplot(data=combined_df, x='vapourpressure', y='pedestrian_count', alpha=0.5, color = 'purple')
plt.title("Vapour Pressure vs. Pedestrian Count")
plt.xlabel("Vapour Pressure")
plt.ylabel("Pedestrian Count")
plt.tight_layout()
plt.show()
*Vapour Pressure vs Pedestrian Count*
Analysing the results obtained in the scatterplot, There does not seem to be a strong relationship between vapour pressure and foot traffic, we do see a slight decrease in foot traffic as vapour pressure increases, indicating less people walk in more humid climates, however, it is not a strong enough correlation on its own may be effective on improving the models performance when combined with another feature.
Correlation matrix¶
A correlation matrix is an effective method to represent the linear relationship between features in a dataset. It is an essential tool utlitised in feature engineeringm as it aids in identifying how strongly features are correlated with each other, find features that have strong correlation with the target variable and importantly enable the detection of multicolinearlity in which two features are highly similar and hence may reduce the perfromance of the ML model.
- Combine the features into a singular dataset. All relavent features are moved into a singular dataset.
- Implement and visualise the correlation matrix.
combined_data = combined_df.drop(columns=['date_time', 'time', 'command', 'precipitation', 'longitude','latitude'])
# Coreelation matrix strucutre
c_matrix = combined_data.corr()
# visualisation of the matrix
plt.figure(figsize=(10, 6))
sns.heatmap(c_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of Weather Features and Pedestrian Count")
plt.tight_layout()
plt.show()
*Correlation matrix*
The correlation matrix allows us to determine the linear relationship between two feature pairs by anaysling the values it will allow us to determine, which features have a high correlation with pedestrian count and which features, do not. This will also enable us to determine how similiar featrues are to each other. The values range from 1 to -1, indicating a positive linear relationship between two features, the values -1 is indicative of a negitive linear relationship between the two features. A value of 0 indicates no linear corrleation between the two features or as such the relationship cannot be captured linearly. By imploring the correlation matrix it has enabled a visual representation of how each feature correlates to the the target variable pedestrian count.
Most Important Features
Determined on correlation value and practical application, the key features to consider are:
- relativehumidity
- windspeed
- airtemp
- gustspeed
- vapourpressure
- atmosphericpressure
Data Quality and Summary Statistics Overview Venue data¶
- Missing data Check
- I check the total number of missing data within each column with the sue of
.isnull().sum(), this will indicate if there is missing data and which column does have it.
- I check the total number of missing data within each column with the sue of
- Statistical Description
- To obtain a statistical summary I use the
.describe()method. This will aid in understanding the distribution of the data and will point out any annomailies.
- To obtain a statistical summary I use the
- Data Structure Summary
- To obtain a statistical overview of the dataset and find outliers it is useful to implement
.info()
- To obtain a statistical overview of the dataset and find outliers it is useful to implement
print("Check for missing information in each column")
print(venue_location.isnull().sum())
print("Statistical Description")
print(venue_location.describe())
print("Data Structure Summary")
print(venue_location.info())
Check for missing information in each column
geo_point_2d 0
geo_shape 0
prop_id 0
no_smoking 0
level_1_na 0
addresspt1 0
event 0
full_name 0
addressp_1 0
training 0
dog_prohib 0
dog_off_le 0
venue_recn 0
addresspt 0
sport 139
promotion 206
bookable 0
level_3_na 192
wedding 0
roadseg_id 0
sustainabi 18
level_2_na 0
latitude 0
longitude 0
dtype: int64
Statistical Description
latitude longitude
count 206.000000 206.000000
mean -37.810742 144.963473
std 0.017024 0.016949
min -37.844590 144.913667
25% -37.821651 144.954027
50% -37.812328 144.968012
75% -37.798513 144.976381
max -37.778101 144.989091
Data Structure Summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geo_point_2d 206 non-null object
1 geo_shape 206 non-null object
2 prop_id 206 non-null object
3 no_smoking 206 non-null object
4 level_1_na 206 non-null object
5 addresspt1 206 non-null object
6 event 206 non-null object
7 full_name 206 non-null object
8 addressp_1 206 non-null object
9 training 206 non-null object
10 dog_prohib 206 non-null object
11 dog_off_le 206 non-null object
12 venue_recn 206 non-null object
13 addresspt 206 non-null object
14 sport 67 non-null object
15 promotion 0 non-null object
16 bookable 206 non-null object
17 level_3_na 14 non-null object
18 wedding 206 non-null object
19 roadseg_id 206 non-null object
20 sustainabi 188 non-null object
21 level_2_na 206 non-null object
22 latitude 206 non-null float64
23 longitude 206 non-null float64
dtypes: float64(2), object(22)
memory usage: 38.8+ KB
None
*Missing data Check*
- We can see from the results obtained there are multiple columns with missing data.
- Columns with missing data:
sport, with a total of 139 missing valuespromotion, with a total of 266 missing valueslevel_3_na, with a total of 192 missing valuessustainabi, with 18 missing values- The remaining columns all have 0 missing values
*Data Structure Summary*
- Most data is of type object, indicating there may be many categorical data types within this dataset that will need to be changing into numerical.
- latitude and longitude, are of type float and are centered around melbourne.
Converting categorical data to numerical.¶
The for loop, loops throught the features in the venue_locations dataset. Given the data in the column is catergorical Y or N it will convert the values to the respective 1 and 0. This is important as ML models cannot understand categorical data. It is important to convert them into numerical data.
#converting categorical features in venue location to numerical for Y/N to 1/0
for col in ['event', 'wedding', 'sport', 'bookable', 'dog_prohib','training', 'dog_off_le', 'no_smoking']:
venue_location[col] = venue_location[col].apply(lambda x: 1 if str(x).strip().upper() == 'Y' else 0)
venue_location.head()
| geo_point_2d | geo_shape | prop_id | no_smoking | level_1_na | addresspt1 | event | full_name | addressp_1 | training | ... | sport | promotion | bookable | level_3_na | wedding | roadseg_id | sustainabi | level_2_na | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | {'lon': 144.98896595113547, 'lat': -37.8171274... | {'type': 'Feature', 'geometry': {'coordinates'... | 0 | 0 | Other Park Locations | 29.98397741 | 1 | Other Park Locations: Weedon Reserve | 77 | 0 | ... | 0 | None | 1 | None | 0 | 21951 | L | Weedon Reserve | -37.817127 | 144.988966 |
| 1 | {'lon': 144.98293613155118, 'lat': -37.8425742... | {'type': 'Feature', 'geometry': {'coordinates'... | 103756 | 0 | Fawkner Park | 0.0 | 0 | Fawkner Park: FP - Cordner_T1_T2_T3_F2_Ct3 | 0 | 0 | ... | 1 | None | 1 | None | 0 | 0 | M | FP - Cordner_T1_T2_T3_F2_Ct3 | -37.842574 | 144.982936 |
| 2 | {'lon': 144.9554621837315, 'lat': -37.78557214... | {'type': 'Feature', 'geometry': {'coordinates'... | 107426 | 0 | Royal Park | 0.0 | 0 | Royal Park: RP - Lawn 6_Walker East_Ct2 | 0 | 0 | ... | 1 | None | 1 | None | 0 | 0 | L | RP - Lawn 6_Walker East_Ct2 | -37.785572 | 144.955462 |
| 3 | {'lon': 144.98286097598913, 'lat': -37.8445902... | {'type': 'Feature', 'geometry': {'coordinates'... | 103756 | 0 | Fawkner Park | 0.0 | 1 | Fawkner Park: FP - Lawn 21_T8_S3 | 0 | 1 | ... | 1 | None | 1 | None | 0 | 0 | M | FP - Lawn 21_T8_S3 | -37.844590 | 144.982861 |
| 4 | {'lon': 144.97660241098285, 'lat': -37.8246014... | {'type': 'Feature', 'geometry': {'coordinates'... | 108615 | 0 | Kings Domain | 0.0 | 1 | Kings Domain: KD - Lawn 9/Pillars of Wisdom | 0 | 1 | ... | 0 | None | 1 | None | 1 | 0 | M | KD - Lawn 9/Pillars of Wisdom | -37.824601 | 144.976602 |
5 rows × 24 columns
Dropping Unnecessary Columns¶
To clean the dataset redundant information is dropped the columns dropped are:
geo_point_2d:Redundant with latitude and longitudegeo_shape:Complex geometry not usable in standard ML pipelineslevel_1_naunnecessary dataaddresspt1: Textual address; not informative for MLaddressptunnecessary datalevel_3_naunnecessary dataroadseg_idCategorical ID with no predictive signal unless joined with external datasetsustainabiHas no significance, in capturing pedestrian countlevel_2_naunnecessary data
print(venue_location.columns)
venue_location.drop(['geo_point_2d','geo_shape', 'level_1_na', 'addresspt1', 'addresspt','level_3_na', 'roadseg_id', 'sustainabi','level_2_na'], axis=1, inplace=True)
venue_location.head()
Index(['geo_point_2d', 'geo_shape', 'prop_id', 'no_smoking', 'level_1_na',
'addresspt1', 'event', 'full_name', 'addressp_1', 'training',
'dog_prohib', 'dog_off_le', 'venue_recn', 'addresspt', 'sport',
'promotion', 'bookable', 'level_3_na', 'wedding', 'roadseg_id',
'sustainabi', 'level_2_na', 'latitude', 'longitude'],
dtype='object')
| prop_id | no_smoking | event | full_name | addressp_1 | training | dog_prohib | dog_off_le | venue_recn | sport | promotion | bookable | wedding | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | Other Park Locations: Weedon Reserve | 77 | 0 | 0 | 0 | 402 | 0 | None | 1 | 0 | -37.817127 | 144.988966 |
| 1 | 103756 | 0 | 0 | Fawkner Park: FP - Cordner_T1_T2_T3_F2_Ct3 | 0 | 0 | 0 | 0 | 1354 | 1 | None | 1 | 0 | -37.842574 | 144.982936 |
| 2 | 107426 | 0 | 0 | Royal Park: RP - Lawn 6_Walker East_Ct2 | 0 | 0 | 0 | 0 | 277 | 1 | None | 1 | 0 | -37.785572 | 144.955462 |
| 3 | 103756 | 0 | 1 | Fawkner Park: FP - Lawn 21_T8_S3 | 0 | 1 | 0 | 1 | 67 | 1 | None | 1 | 0 | -37.844590 | 144.982861 |
| 4 | 108615 | 0 | 1 | Kings Domain: KD - Lawn 9/Pillars of Wisdom | 0 | 1 | 0 | 0 | 184 | 0 | None | 1 | 1 | -37.824601 | 144.976602 |
Geospatial Visualisation of Venue Location and Pedestrian Sensor location.¶
This follium graph will represent the visual markers of all of the venue locations across melbourne, this will enable us to see the spread of location venues. The graph also has visual representation of the pedestrian count sensor locations, enabling a clear representation of the the spread of venue and pedestrian sensors and how they correlate. Aiding in better understanding the problem.
import folium
#centre the map on Melbourne
m = folium.Map(location=[-37.8136, 144.9631], zoom_start= 14)
#iterate over the rows in venue location
for _, row in venue_location.iterrows():
if pd.notnull(row['latitude']) and pd.notnull(row['longitude']):
folium.CircleMarker(
location=[row['latitude'], row['longitude']],
radius=3,
color='blue',
fill=True
).add_to(m)
for _, row in pedestrian_counting.iterrows():
if pd.notnull(row['latitude']) and pd.notnull(row['longitude']):
folium.CircleMarker(
location=[row['latitude'], row['longitude']],
radius=3,
color='yellow',
fill=True
).add_to(m)
# Add legend
legend_html = """
<div style="position: fixed;
bottom: 50px; left: 50px; width: 150px; height: 100px;
background-color: rgba(255, 255, 255, 0.8); border:2px solid grey; z-index:1000; font-size:12px;
padding: 10px;">
<b>Legend</b><br>
<i style="background:blue; width:10px; height:10px; display:inline-block; border-radius:50%;"></i> Venue Location<br>
<i style="background:yellow; width:10px; height:10px; display:inline-block; border-radius:50%;"></i> Pedestrian Counter<br>
</div>
"""
m.get_root().html.add_child(folium.Element(legend_html))
m
Visualisation Summary of Spatial Data¶
The output generated is an interactive map of the Melbourne CBD featuring:
- Yellow dots: Pedestrian counter
- Blue dots: Venue Locations
From the map, several key insights can be drawn:
- Pedestrian sensors are installed only at specific points, which means that pedestrian traffic data is not uniformly available city-wide.
- Venue Locations, tend to surround the city and only few venue location have a sensor within a close distance, this does pose the difficulty of accuratly determining the pedestrian count at the venue locations.
- This limitation inidcates, the necessity to utilise various metrics to capture relationships with pedestrian count and be able to provide a reliable output.
- Visualising this spatial data has proven to be highly informative it provides critical context to the effective design and implementation of the AI pedestrian flow use case.
Calculating Distance from Venue to Pedestrian count¶
Utilising GeoPy, which is a python library that makes geographical calculations on distance in this case between the Venue location and each pedestrian sensor, this will be a key feature in determining the pedestrian count at the venue location as the venue lcoation data does not contain pedestrian sensors at the location or historical time data.
from geopy.distance import geodesic
# store the distance values and the other essential features
distance_rows = []
# for loop, loops through each venue in venue location and assigned the value to the name
for _, venue in venue_location.iterrows():
venue_loc = (venue['latitude'], venue['longitude'])
venue_id = venue['prop_id']
venue_name = venue['full_name']
sport = venue['sport']
wedding = venue['wedding']
bookable = venue['bookable']
no_smoking = venue['no_smoking']
event = venue['event']
training = venue['training']
dog_prohib = venue['dog_prohib']
# the secondary loop, loops through pedestrian count and caclulates the distance
for _, sensor in pedestrian_counting.iterrows():
sensor_loc = (sensor['latitude'], sensor['longitude'])
sensor_id = sensor['location_id']
# The distance is calculated using geodesic venue location and sensor locaiton.
distance = geodesic(venue_loc, sensor_loc).meters
# the data is then appended to the distance_rows list
distance_rows.append({'venue_id': venue_id, 'venue_name': venue_name, 'sensor_id': sensor_id, 'distance_venue': distance,
'sport': sport, 'wedding': wedding, 'bookable':bookable, 'no_smoking':no_smoking, 'event': event,
'training': training, 'dog_prohib': dog_prohib} )
#distance_rows is converted into pandas dataframe and renamed under venue_sensor_distance
venue_sensor_distance = pd.DataFrame(distance_rows)
venue_sensor_distance.head()
| venue_id | venue_name | sensor_id | distance_venue | sport | wedding | bookable | no_smoking | event | training | dog_prohib | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Other Park Locations: Weedon Reserve | 86 | 3841.864082 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 1 | 0 | Other Park Locations: Weedon Reserve | 86 | 3841.864082 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 0 | Other Park Locations: Weedon Reserve | 24 | 3041.625597 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | Other Park Locations: Weedon Reserve | 72 | 1782.008421 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 4 | 0 | Other Park Locations: Weedon Reserve | 10 | 3690.259064 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
The process of calculating the distance to each sensor from the venue has significantly increased the size fo the dataframe, a lot of this information is not relavent as we are only concerned about the resutls obtained by the pedestrian sensor closest to the venue. I will then group the data and only keep the shortest distance from each venue to the closest sensor.
Nearest Sensor Calculation¶
The data is grouped by venue id then the shortest distance is calculated utilising idxmin()
# Select the row with the minimum distance for each venue_id
nearest_sensors = venue_sensor_distance.loc[venue_sensor_distance.groupby('venue_id')['distance_venue'].idxmin()].reset_index(drop=True)
nearest_sensors.head()
| venue_id | venue_name | sensor_id | distance_venue | sport | wedding | bookable | no_smoking | event | training | dog_prohib | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Promotional Spaces: Melbourne Central Station... | 66 | 62.590427 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 100385 | Alexandra Gardens: AG - Henley Landing | 136 | 347.445537 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 100514 | Other Park Locations: North Melbourne Recreat... | 180 | 696.844279 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 3 | 100894 | Other Park Locations: Bayswater Road Park | 85 | 528.924036 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 4 | 101101 | Alexandra Gardens: AG - Lawn2/Peppercorn Lawn | 29 | 60.177471 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
When analysing the shape of the matrix we can see that it has decreased significantly from the 2059794 rows to 80.
print(pedestrian_counting.columns)
Index(['location_id', 'sensing_date', 'hourday', 'direction_1', 'direction_2',
'pedestriancount', 'latitude', 'longitude', 'date_time'],
dtype='object')
Merging Datasets¶
The final dataset is created by merging pdestrianc count, nearest_senor(venue data) and weather data. Merge pedestrian count with nearest sensor left on location id and right on sensor id with inner join. Next model_df is merged with weather station data on date_time with inner join. The purpose is to combine all the datasets to extract the key features to be able to apply predicitive models.
# Merge pedestrian count with nearest sensor left on location id and right on sensor id with inner join.
model_df = pd.merge(
pedestrian_counting,
nearest_sensors,
left_on='location_id', right_on='sensor_id',
how='inner'
)
# model_df is merged with weather station data on date_time with inner join
model_df = pd.merge(
model_df,
weather_stations_df,
on='date_time',
how='inner'
)
model_df.head()
| location_id | sensing_date | hourday | direction_1 | direction_2 | pedestriancount | latitude_x | longitude_x | date_time | venue_id | ... | strikes | windspeed | winddirection | gustspeed | vapourpressure | atmosphericpressure | relativehumidity | airtemp | latitude_y | longitude_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130 | 2025-03-03 | 14 | 47 | 32 | 79 | -37.820464 | 144.941268 | 2025-03-03 14:00:00 | 604120 | ... | 0 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 | None | None |
| 1 | 130 | 2025-03-03 | 14 | 47 | 32 | 79 | -37.820464 | 144.941268 | 2025-03-03 14:00:00 | 611697 | ... | 0 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 | None | None |
| 2 | 130 | 2025-03-03 | 14 | 47 | 32 | 79 | -37.820464 | 144.941268 | 2025-03-03 14:00:00 | 651910 | ... | 0 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 | None | None |
| 3 | 130 | 2025-03-05 | 10 | 58 | 22 | 80 | -37.820464 | 144.941268 | 2025-03-05 10:00:00 | 604120 | ... | 0 | 1.30 | 154.4 | 3.01 | 2.06 | 101.69 | 86 | 20.4 | None | None |
| 4 | 130 | 2025-03-05 | 10 | 58 | 22 | 80 | -37.820464 | 144.941268 | 2025-03-05 10:00:00 | 604120 | ... | 0 | 1.23 | 148.7 | 2.93 | 2.07 | 101.68 | 86 | 20.5 | None | None |
5 rows × 37 columns
Cleaning Model_df¶
This process, is the final removal of all features that do not provide any valuable insight into prediciting pedestrian count. It removes unecessary data and cleans the dataset.
model_df.drop(['direction_1','direction_2', 'rtc', 'solarpanel', 'command', 'solar', 'precipitation', 'strikes', 'time','latitude_x', 'longitude_x', 'latitude_y', 'longitude_y', 'sensor_id', 'venue_name' ], axis = 1, inplace=True)
model_df.head()
| location_id | sensing_date | hourday | pedestriancount | date_time | venue_id | distance_venue | sport | wedding | bookable | ... | training | dog_prohib | battery | windspeed | winddirection | gustspeed | vapourpressure | atmosphericpressure | relativehumidity | airtemp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130 | 2025-03-03 | 14 | 79 | 2025-03-03 14:00:00 | 604120 | 197.215163 | 0 | 0 | 1 | ... | 0 | 0 | 4.172 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 |
| 1 | 130 | 2025-03-03 | 14 | 79 | 2025-03-03 14:00:00 | 611697 | 332.649732 | 0 | 0 | 1 | ... | 0 | 0 | 4.172 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 |
| 2 | 130 | 2025-03-03 | 14 | 79 | 2025-03-03 14:00:00 | 651910 | 49.092892 | 0 | 0 | 1 | ... | 0 | 0 | 4.172 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 |
| 3 | 130 | 2025-03-05 | 10 | 80 | 2025-03-05 10:00:00 | 604120 | 197.215163 | 0 | 0 | 1 | ... | 0 | 0 | 4.183 | 1.30 | 154.4 | 3.01 | 2.06 | 101.69 | 86 | 20.4 |
| 4 | 130 | 2025-03-05 | 10 | 80 | 2025-03-05 10:00:00 | 604120 | 197.215163 | 0 | 0 | 1 | ... | 0 | 0 | 4.184 | 1.23 | 148.7 | 2.93 | 2.07 | 101.68 | 86 | 20.5 |
5 rows × 22 columns
Incorporating day of week¶
Given the impact that date and time have on pedestrian count as seen earlier, day_of_week feature will train the model on this relationship and help improve the accuracy of the model.
model_df['day_of_week'] = model_df['sensing_date'].dt.dayofweek
print(model_df.columns)
Index(['location_id', 'sensing_date', 'hourday', 'pedestriancount',
'date_time', 'venue_id', 'distance_venue', 'sport', 'wedding',
'bookable', 'no_smoking', 'event', 'training', 'dog_prohib', 'battery',
'windspeed', 'winddirection', 'gustspeed', 'vapourpressure',
'atmosphericpressure', 'relativehumidity', 'airtemp', 'day_of_week'],
dtype='object')
model_df.head()
| location_id | sensing_date | hourday | pedestriancount | date_time | venue_id | distance_venue | sport | wedding | bookable | ... | dog_prohib | battery | windspeed | winddirection | gustspeed | vapourpressure | atmosphericpressure | relativehumidity | airtemp | day_of_week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130 | 2025-03-03 | 14 | 79 | 2025-03-03 14:00:00 | 604120 | 197.215163 | 0 | 0 | 1 | ... | 0 | 4.172 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 | 0 |
| 1 | 130 | 2025-03-03 | 14 | 79 | 2025-03-03 14:00:00 | 611697 | 332.649732 | 0 | 0 | 1 | ... | 0 | 4.172 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 | 0 |
| 2 | 130 | 2025-03-03 | 14 | 79 | 2025-03-03 14:00:00 | 651910 | 49.092892 | 0 | 0 | 1 | ... | 0 | 4.172 | 0.93 | 181.5 | 2.41 | 1.34 | 101.60 | 77 | 15.4 | 0 |
| 3 | 130 | 2025-03-05 | 10 | 80 | 2025-03-05 10:00:00 | 604120 | 197.215163 | 0 | 0 | 1 | ... | 0 | 4.183 | 1.30 | 154.4 | 3.01 | 2.06 | 101.69 | 86 | 20.4 | 2 |
| 4 | 130 | 2025-03-05 | 10 | 80 | 2025-03-05 10:00:00 | 604120 | 197.215163 | 0 | 0 | 1 | ... | 0 | 4.184 | 1.23 | 148.7 | 2.93 | 2.07 | 101.68 | 86 | 20.5 | 2 |
5 rows × 23 columns
Machine Learning Model¶
After the dataset have been unified and all the key features are extrated into the final dataset, the last step is to apply the Machine Learning model. In this instance, I have applied multiple, multiple ML models to assess the performance in with the use of various ML models.
Feature Selection¶
Preparing to apply the ML model we need to select the final features that will be utilised, and select the target variable. The Features are maintained in a list with all the features, the target variable is selected in which case the goal is to identify the pedestrian count at the provided location.
from sklearn.model_selection import train_test_split
import sklearn.preprocessing as preproc
from sklearn.ensemble import RandomForestRegressor
#Features is a list of all the features that the ML model will be trained on
features = [
'distance_venue','sport','wedding','bookable','no_smoking', 'event','training','dog_prohib',
'battery','windspeed','winddirection','gustspeed','vapourpressure','atmosphericpressure','relativehumidity',
'airtemp','day_of_week'
]
#The X variable is the features
X = model_df[features]
# The y variable is the target variable
y = model_df['pedestriancount']
Polynomial Features¶
I have implemented pairwise interaction features, this process is utilised in predictive modelling as the combined effect of two features is different than the individual features. This will allow me to capture relationships between features that are of a higher degree and allows for the model to better predict the target variable, provided there are sufficient sample.
# pairwise interaction features
X2 = preproc.PolynomialFeatures(degree=2,include_bias=False).fit_transform(X)
# Output the shape of the transformed features
print(X2.shape)
(4707, 170)
We can see that, the features have significantly increased within the dataset, we have gone from the inital 17 features, to the final 170 features after pairwise interaction features have been applied.
Train Test Split¶
In this process the dataset is split into two parts the training set which is implemented to train the model and the testing set, which is data that model has not seen to be able to test the performance against. In this case the test_size is 0.2 indicating 80% of the data will be utilised in training and 20% will be utilised in testing.
X1_train, X1_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=123)
X2_train, X2_test, _, _ = train_test_split(X2, y, test_size=0.2,random_state=123)
Applying Linear Regression Model¶
-1) The Linear Regression Model is applied to the initial feature set. The model is trained on the original dataset and will utilise X1_train.
Once the model is trained using linear_r_model.fit(), the model will then make predictions using linear_r_model.predict(X1_test).
The performance of the model is finally determined by root mean squared error and R^2 Score.
-2) The Linear Regression Model is applied to the interaction feature set. The model is trained on the interaction features X2 and will utilise X2_train.
Once the model is trained using linear_r_model.fit(), the model will then make predictions using linear_r_model.predict(X2_test).
The performance of the model is finally determined by root mean squared error and R^2 Score.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error, r2_score
# Instantiate Linear regression model
linear_r_model = LinearRegression()
# Train the model on original features X1
linear_r_model.fit(X1_train, y_train)
# From the trained model predictions can be made
y_pred = linear_r_model.predict(X1_test)
# Root mean squared error is utilised to evaluate the perfromance of the model.
print("RMSE:", root_mean_squared_error(y_test, y_pred))
# R^2 score is another method ultised to evaluate the perfomance of the model.
print("R² Score:", r2_score(y_test, y_pred))
RMSE: 470.3250843207092 R² Score: 0.21154484393837447
Linear Regression Original Feature Analysis¶
Root Mean Squared Error (RMSE), measures the models predictive error. This is code by cacluating the average difference between the models predicited value and the actual results obtained by the model. In this instance the model has obtained an RMSE value of 470.3, indicating the models predicitions are on average 470.3 counts of the original pedestrian count. This value is quite far off as pedestrian count varies from 0 - 3500, with most in the range of 0-1000. R² value, by defintion indicates the proportion of variance in the target variable that is explained by the features in a provided regression model. Measuring the goodnes of fit, values range from 0-1. The results obtained R² Score: 0.21. This indicates the model doesnt explain much of the varience. The linear regression model isnt able accuratley capture the relationships that exist between the features and the target variable pedestrian count.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error, r2_score
# Instantiate Linear regression model
linear_r_model = LinearRegression()
# Train the model on interaction features X2
linear_r_model.fit(X2_train, y_train)
# From the trained model predictions can be made
y_pred = linear_r_model.predict(X2_test)
# Root mean squared error is utilised to evaluate the perfromance of the model.
print("RMSE:", root_mean_squared_error(y_test, y_pred))
# R^2 score is another method ultised to evaluate the perfomance of the model.
print("R² Score:", r2_score(y_test, y_pred))
RMSE: 414.47537844617756 R² Score: 0.3876803635445494
Linear Regression Original Feature Analysis¶
Root Mean Squared Error (RMSE), value of 414.5, indicating the models predicitions are on average 414.5 counts of the original pedestrian count. This value is quite far off and is a slight improvement from the standard features. The RMSE value shouldnt be taken as prime indicator, as many of the venue locations that are predicted do not have pedestrian count sensors located. It can still be utilised as a tool to measure and understand the performance of the model. With that said the RMSE value is still quite high and the interaction features were only able to improve the RMSE slightly from the standard features.
The results obtained R² Score: 0.38, The results obtained still indicate the model doesnt explain much of the varience, however the perfromance has drastically improved from the standard features. With the interaction features the model was effectively explain double the variance and is a signficant improvment.
The linear regression model with the interaction features, has been able to expalin almost double the variance that has existed in the data, but the RMSE score has only marginally improved in this case.
Applying RandomForest Regression Model¶
-1) The RF Regression Model is applied to the initial feature set. The model is trained on the original dataset and will utilise X1_train.
Once the model is trained using RF_model.fit(), the model will then make predictions using RF_model.predict(X1_test).
The performance of the model if finally determined by root mean squared error and R^2 Score.
-2) The RF Regression Model is then applied to the interaction feature set. The model is trained on the interaction features X2 and will utilise X2_train.
Once the model is trained using RF_model.fit(), the model will then make predictions using RF_model.predict(X2_test).
The performance of the model if finally determined by root mean squared error and R^2 Score.
This will allow a comparison with the standard features and the interaction features to determine how the performance of the model is affected
from sklearn.ensemble import RandomForestRegressor
# Instantiate RF Regressor model
RF_model1 = RandomForestRegressor(random_state=42)
# Train the model on original features X1
RF_model1.fit(X1_train, y_train)
# From the trained model predictions can be made
y_pred1 = RF_model1.predict(X1_test)
# Root mean squared error is utilised to evaluate the perfromance of the model.
print("RMSE:", root_mean_squared_error(y_test, y_pred1))
# R^2 score is another method ultised to evaluate the perfomance of the model.
print("R² Score:", r2_score(y_test, y_pred1))
RMSE: 232.55353797737882 R² Score: 0.8072356955112175
Random Forest Regression Original Feature Analysis¶
Random Forest Regression is an ensemble learning method which combines predicitions from multiple decision trees to produce more accurate values. The choice of regression model comes as the model is used for good performance and inparticular the ability to handle complex non-linear relationships. This is a key factor in the decision to utilise the Random Forest model.
Root Mean Squared Error (RMSE), value of 232.55, is an excellent result, provided the value range is from 0 - 3500, the performance has seen a marketable jump from the use of the linear regression model for both standard and interaction features. The RMSE value has halved from the linear regression mode. The RMSE value is an indicator of performance the value itself would be impossible to get close to 0 as pedestrian count data, or venue attendance and time data does not exist. The model is mearly providing its best estimate with the available features.
The results obtained R² Score: 0.81, The results obtained indicate good fit, especially with the lack of venue time and attendance data. The model can explain most of the varience, and is a signficant increase in performance from the linear regression model in both standard and interaction features.
The Random Forest regression model with standard features, has been able to expalin more than double the variance that has existed in the data in both performance metrics when compared to the linear regression model.
Indicating the model can be successfully utilsied to accurately predict the pedestrian count at any venue within melbourne.
# Instantiate RF Regressor model
RF_model2 = RandomForestRegressor(random_state=42)
# Train the model on original features X1
RF_model2.fit(X2_train, y_train)
# From the trained model predictions can be made
y_pred2 = RF_model2.predict(X2_test)
# Root mean squared error is utilised to evaluate the perfromance of the model.
print("RMSE:", root_mean_squared_error(y_test, y_pred2))
# R^2 score is another method ultised to evaluate the perfomance of the model.
print("R² Score:", r2_score(y_test, y_pred2))
RMSE: 235.5909624006447 R² Score: 0.8021673510884505
Random Forest Regression Interaction Feature Analysis¶
Root Mean Squared Error (RMSE), value of 235.59, is an excellent result, provided the value range is from 0 - 3500, the performance has seen a marketable jump from the use of the linear regression model for both standard and interaction features. When comparing the performance against the standard features we can see that the performance is marginally worse with interaction features, this may be a result of noise within the data. The increase in features may affect the performance, in this instance it is only marginally worse.
The results obtained R² Score: 0.80, The results obtained indicate good fit, especially with the lack of venue time and attendance data. The model can explain most of the varience, and is a signficant increase in performance from the linear regression model in both standard and interaction features. When comparing the results obtained in the interaction feature, we see there is a marginal decrease in performance, similar to that of RMSE.
The Random Forest regression model with interaction features, has been able to expalin more than double the variance in comparison to the Linear regression model for both cases. However, the performance of the interaction features has decreased marginally from that of the standard features, indicating with the random forest model, it is better to utilise the standard features to predict.
Conclusion: AI Pedestrian Flow Prediction¶
The AI Pedestrian Flow Prediciton provides a data driven, solution to be able to determine pedestrian traffic at provided location venues, throughout the city. By combining multiple datasets from the Melbourne Open data portal I was able to capture the relationship between many of the features and pedestrian count to effectively predict the pedestrian traffic at the particular venue.
Throught the feautre engineering process, I utilised many methods to understand and capture the relationship between the pedestrian count and the features. In particular in weather data, utilising the data provided I was able to capture the relationship that existed between different weather metrics. Unforturnatley I was not able to utilise the precipitation data, as the data provided did not have any values associated. I attempted to mitagate this by utilise metrics such as atmospheric pressure, vapour pressure, humidity and airtempreture as rainfall can be predicited with these features. Time of day was also a significant indicator of the pedestrian flow, with this information I incorporated features such as day of week, which would help the model determine further relationships between the days and foot traffic present. As in the real world pedestrian traffic varies day to day and more signifantly varies from weekday and weekend, this relationship was important to capture and played an important role in the pefromance obtained.
Once I was able to determine which features where strong predictors, I was able to merge all of the key features into the model df and apply predicitive models. As seen above the Random Forest Regression model with the standard features outperformed all the other models in particular the Linear regression model. The present limitation were in the venue dataset, as it had a number of missing values, as well as the lack of historical data, such as times that venues are booked, what type of booking is made and the attendence. With this additional data, it will greatly improve the performance of the model and can be used with confidence when planning or determining foot traffic at specific venues.
The model serves as a ultility for city planners, to be able to accuractly determine the predicted pedestrian traffic at any venue location within the city. This information is crucial in providing insight for city planners and emergency services to manage pedestrian movment to reduce overcrowding.